0%

(ICCV 2017) Stackgan:Text to photo-realistic image synthesis with stacked generative adversarial networks

Posted on 2018-05-30 In Paper Note , Cross-Modality , Image-Text Views:

Keyword [StackGAN]

Zhang H, Xu T, Li H, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//IEEE Int. Conf. Comput. Vision (ICCV). 2017: 5907-5915.

1. Overview

1.1. Motivation

existing methods fail to contain details and vivid object parts
instability of training GAN
the limited number of training text-image pairs often results in sparsity in the text conditioning manifold and such sparsity makes it difficult to train GAN

In this paper, it proposed StackGAN

decompose the hard problem into more manageable sub-problems
- stage I. sketch the primitive shape and colors, low-resolution
- stage II. details

Conditioning Augmentation Technique. smoothness in the latent conditioning manifold

1.2. Contribution

StackGAN
Conditioning Augmentation (CA)

1.3.1. Generative Model

VAE
Pixel RNN
GAN
energy-based GAN

1.3.2. Conditional Image Generation

variable such as attributes or class label
image-to-image. photo editing, domain transfer, SR

1.3.3. Series of GAN

2. StackGAN

2.1. Conditioning Augmentation

latent space for text embedding usually high, limited amount of data causes discontinuity in the latent data manifold
CA yields more training pairs, smoothness over conditioning manifold and avoid overfitting

2.2. Stage-I GAN

set λ = 1
I_0. real image

2.3. Stage-II GAN

s_0. LR generated by stage-I
two stages share the same text encoder and different CA

2.4. Details

first train stage-I GAN, fix stage-II GAN
then train stage-II GAN, fix stage-I GAN
0.0002 Adam decay 0.5, mini-batch 64
nearest-neighbour upsample
dimension of z 100

3. Experiments

3.1. Dataset

MSCOCO
CUB

3.2. Metric

Inception Score

x. generated sample
y. label predicted by Inception Model (fine-tune on Experiment dataset)

Human Evaluation

3.3. Comparison

GAN-INT-CLS. only reflect the general shape and color of the birds
GAWWN. fail to generate plausible images

stage-II GAN can correct the defects of stage-I
even when stage-I fails to draw a plausible shape, shape-II can generate reasonable object

3.4. Ablation Study

CA helps stabilize training and improve diversity of generated samples, because of its ability to encourage robustness to small perturbation along the latent manifold

3.5. Interpolation